The Polygraph Place

Thanks for stopping by our bulletin board.
Please take just a moment to register so you can post your own questions
and reply to topics. It is free and takes only a minute to register. Just click on the register link


  Polygraph Place Bulletin Board
  Professional Issues - Private Forum for Examiners ONLY
  PCSOT AFMGQT

Post New Topic  Post A Reply
profile | register | preferences | faq | search

next newest topic | next oldest topic
Author Topic:   PCSOT AFMGQT
Taylor
Member
posted 05-18-2008 08:17 AM     Click Here to See the Profile for Taylor   Click Here to Email Taylor     Edit/Delete Message
e - I never heard this presentation you mentioned. You said,'I recall a presentation by Don Kraphol a few years ago where he broke some very bad news about AFMGQT---that it was not the stellar test I'd thought---having ran a (what felt like)gazillion of them up til that point.'

Can you provide a bit more info for me as I kinda like the AFMGQT. Taylor

IP: Logged

rnelson
Member
posted 05-18-2008 08:47 PM     Click Here to See the Profile for rnelson   Click Here to Email rnelson     Edit/Delete Message
Here is a theoretical discussion of the problems of the MGQT, or more properly, the GQT and the family of MGQTs. The Federal Psychophysiological Detection of Deception Examiner's Handbook of 10/2006 uses the more general term Comparison Test Formats (CTF).

Historically, it seems that Reid and Backster developed two distinct but largely similar versions of the comparison question technique, based the interpretation of differential reactivity and differences in the saliency of relevant and comparison questions, while recording physiological information about respiration, cardiovascular activity, and electrodermal activity.

Reid evolved his technique rather directly from the RI technique, by replacing the Irrelevant (neutral) Questions) at positions 6 and 10 with a Comparison Question. Ansley (1998) described some of the evolution from the Reid technique to the Army Zone, including the use of Backster type exclusive comparison questions, and numerical scoring.

Backster seems to have developed the Zone Comparison Technique (ZCT or Zone) from his evaluation of the theoretical problems surrounding polygraph testing, including the need for comparison questions, the desire to understand the distinctness or separation between the comparison and relevant questions, the possibility of outside influence and other factors.

Other variants of the MGQT have emerged, including the highly psychologized Marcy Technique, the Air Force MGQT. A previous edition of the AAPP polygraph examiner handbook included descriptions of the Navy MGQT and Secret Service MGQT, the latter of which seems now to be regarded a form of Air Force MGQT.

Both Reid and Backster have indoctrinated and imprinted generations of polygraph professionals, some of whom may be very competent, and some of whom have attempted to create and validate their own parochial versions, or schools of thought, around these basic methods.

The synchronicity of the evolution of new techniques is not unique. In the field of statistics, we saw the emergence of identical nonparametric methods known as the Mann-Whitney U test, and the Wilcoxon Rank-Sum test. These seem to have been published in different scientific journals within a short time of each other, and the historical consensus is that there was most likely no communication or plagiarism that occurred. It is simply a matter of timing, and the momentum of interest in non-parametric alternatives to the parametric two-sample t-test. These days, the technique is sometimes called the MWW U test or the Mann-Whitney-Wilcoxon rank sum test, in reference to all the people to did work on its development.

At their core, there are few differences between the types of test questions, physiological data, and basic scoring principles. Even the distinction between exclusive and non-exclusive CQs is now viewed as silly by people who have studied the data pertaining to these questions. Primary differences exist in target selection and question formulation and decision policies.

It is interesting to note that the trend of evolution from research and the study of data has tended to simplify our use and understanding of these polygraph techniques, while the improvements from trainers in field settings have involved more theoretical or psychologized solutions to potential problems. The intended function of symptomatics have generally not withstood the challenges of inquiry (Honts, Amato, & Gordon, 2000; Krapohl & Ryan 2001), nor has the overal truth question (Hiliard, 1979). Inside track questions have not been subject to any realistic investigation of the null-hypothesis.

I'll have to check, when I get home, but I believe Honts and Raskin (2002) describe the primary differences in Zone and MGQT techniques in terms of target selection and question formulation, in which the Zone approach has been to select the most intense behavioral issue or allegation and construct several semantically identical, though linguistically distinct, questions about that concern, and then interpret the test as a single issue. The MGQT approach is to construct several distinct investigation targets regarding distinct behavioral aspects of a single event or allegation, and interpret the questions with the hope that they will inform the investigator about the nature, role, or degree of involvement in the event or allegation. MGQT exams are, therefore, multi-facet examinations when used in the context of a known events or allegation. Backster calls the Zone an “exploratory” test when it involves multiple facets of a known incident. Under these circumstances the test is essentially an MGQT, and it causes misunderstanding and unnecessary division to insist on idiosyncratic language. It also degrades the primary value of any technique to deviate from an emphasis on its strengths in the false hope that a a single technique can become a panacea for all polygraph complications.

Differences in decision policies, for Zone and MGQT exams exist primarily in the emphasis in interpretation of the total score or test as a whole, compared with attempts to achieve an overall test result through the interpretation of results to individual test questions.

While most polygraph test developers seem to have been unfamiliar to commonly understood statistical complications inherent to the completion of multiple simultaneous significance tests, these differences can be observed in the common trends related to the outcomes of Zone and MGQT exams.

The results of single-issue Zone exams are easy to understand, because, by definition, it is inconceivable that an examinee could lie to one relevant question while being truthful to others. The results of multi-facet examinations becomes semantically and mathematically more complex, because these construction of these exams appears to be intended to improve the specificity of the test to the examinee's role of involvement in the known issue. Semantically, the axiom “DI to one means DI to all” is not so easily endorsed. This becomes even more evident when the MGQT format is adapted for use in mixed-issue screening examinations which commonly investigate several distinct behavioral concerns in the absence of any known allegations. In mixed issues exams it is conceivable that a person could lie regarding involvement in one behavior while being truthful regarding involvement in other behaviors.

The testing advantage of the mixed issue screening exam is intended to be increases sensitivity, to a broader range of behavioral issues. However, Barland, Honts, & Barger (1989) showed that mixed issues examination may not in fact provide the high level of sensitivity hoped for. There are understandable reasons for this phenomena.

Consider that all measurements are in fact estimates, and contain information from several dimensions, including the actual value of the phenomena we seek to measure, along with random measurement error resulting from psychological variability, physiological variability, and component sensor imperfections. For this reason, it is well understood in social and medical sciences that more stable estimates can be achieved by taking several measurements, and then aggregating the measurements together through either addition, averaging, standardization or other transformation. It is for this reason that proper scientific testing involves taking those several measurements in the same manner. For example: if I measure my son's linear height, it would contribute to a more stable measurement if a Measure his height two or three times, perhaps at the beginning and ending of a single day. It would not not, however, contribute to a more stable measurement if I had him stand on his toes for one of those measurements, simply because I want to maximize the linear distance from the floor. The point isn't to seek the maximum distance, but a stable estimate of the true distance. To use a metaphor (simile actually): if we wanted to sight in a hunting rifle we might put three rounds onto a single target, and then make some careful judgment about any necessary adjustment, based on the pattern on the target. That pattern is intended to be a representation of the variance of the weapon. It would make no sense to place the first round on the target from a prone position, the second while seated, and the third standing or kneeling. Doing so introduces an added dimension of variability to the data. Proper sighting is achieved by placing all rounds on the target using the same position/stance, grip, sight-picture, and trigger control.

If we imagine the single issue Zone exam as an attempt to shake apples from a tree (assuming there may be apples in the tree), we then get to stimulate or shake a single tree three different ways, and we get to complete the tree-shaking/stimulation experiment three or more times, for a total experiment involving nine different opportunities to shake apples from one tree. If we shake the tree vigorously enough, we are assured apples will fall if they are present. Having shaken the tree vigorously and to our satisfaction nine total times, we feel reasonably assured that no apples must be present if nothing has fallen.

In contrast, multi-facet examinations represent an experiment in which we have up to four different trees to shake, when evaluating for the presence or absence of apples. One major difference that is readily understood is that we will not be shaking a single tree nine times. Instead, we can now shake each tree only three to five times. The total volume of stimulation applied to each tree is therefore reduced, and this may be the condition underlying the results observed by Barland, Honts & Barland (1989).

So far, I have attempted to illustrate the sensitivity limitations. There are also specificity limitations with the technique. As you know, specificity, in polygraph, is the ability to correctly classify truthful cases. Deficits in specificity are related to both inconclusive results and false-positive errors.

Test accuracy, mathematically speaking, is really about variance. Accuracy of multi-facet and mixed issue examinations is about the variance of the individual spots. Attempts to define spot scores in terms of linear portions of total scores are mathematically negligent. Hint: its all about variance.

The challenges of the MGQT format also involve how we go about interpreting the collection of the several spot results. MGQT scoring rules provide a straightforward method. Fail any question and fail the test. Pass every question to pass the test; which also means that any inconclusive question will produce an inconclusive test result, unless one or more of the questions produces a deceptive score. Nevermind that most, if not all, of the research on MGQT decision cutscores is empirical only, with no real attempt to evaluate the variance of spot scores. Its still about variance, and the statistical rules that drive test results still drive the results of MGQT exam.

Now consider a base-rate problem. For this example, consider the MGQT format used in a LE screening exam. MGQTs make good LEPET and PCSOT formats, because the spot-scoring rules easily adapt to mixed issues situations. I know, someone will say a LEPET is not an MGQT; just think about it. So, pick three or four high-base rate behaviors. For “high,” lets say that >25% of persons are involved in the behavior or attempt to engage in deception regarding the behavior at their polygraph exams. I don't have any exact numbers, but lets imagine that 25% of applicants may under-report their use or recency of involvement in each of: 1) illegal drug use, 2) criminal activities, 3) violence, and 4) lying in their application/disclosure materials. Personally, I feel the last issue (lying to the booklet) is a terrible question, but I understand some agencies' desire to consider it a relevant investigation target. Is it reasonable to imagine that perhaps 25% of all applicants may be involved in each of those activities? If so, then the corollary means that 75% of all applicants are not involved in each. Combining them means that .75 * .75 * .75 * .75 = .32. Or, only 32% of your applicants are not involved in any of those activities. In a perfect setting, this means that 68% of LE applicants should fail their polygraphs. If we are seeing something different, it is a result of differing base rates, or adjustments to decision thresholds.

Now, factor in some imperfections. If we estimate a 5 to 10 % error rate, we have to add that estimation for each distinct relevant target (not the case with a single issue Zone). So, .5 + .5 + .5 + .5 = .20, or .1 + .1 + .1 + .1 = .40. So, our estimated FP error rate climbs quickly to 20 to 40 percent.

Then, look at the mathematical estimations for inconclusive results, which are estimated as:

So, if we estimate a 5 to 10% inconclusive rate, then we can anticipate an inconclusive rate somewhere between 19 and 35% for four questions.

So, we have a test which may not be as sensitive as we hope, and which has some capacity for returning false-positive results, and a bunch of INCs.

There are more effective ways of aggregating the results of multi-facet and mixed issues exams, but that involves math that would put many field examiners to sleep. Procedural can be developed to approximate those mathematical solution in a more expedient manner, and that is what Senter (2003) and I think Senter & Dollins (2003) provide when they suggest that 2-stage rules may be more effective for MGQT exams than traditional MGQT decision policies.

Doubt this yet? Just ask an experienced examiner, who is not trying to BS you with some shell-game confidence-job. In reality, experienced examiners may learn techniques to mitigate or minimize these problems, but professional expertise of that type is largely informal and difficult to quantify.

So, lets look at the results of some studies:

Blackwell (1999) used confirmed ZCT (single issue) examinations to compare experienced federal examiners to the Polyscore 3.3 algorithm, and reported the experienced examiners showed a sensitivity level of .908. Sensitivity to deception is the proportion of deceptive case scored correctly, with inconclusives (7.7%) . Specificity to truthfulness was reported as .543. Specificity is the same as correctly classified confirmed truthful cases, with inconclusives (20.0%). Polygraph examiners often prefer to see results with inconclusives removed, because the results are more flattering. However, sensitivity and specificity with inconclusives provides a clearer representation of the effectiveness of a test. This does NOT mean that INCs are errors; they are simply a fact of life.

Blackwell (1999) also evaluated the results of experienced federal examiners with confirmed MGQT cases, and reported sensitivity = .975, with INC = .025, and specificity = .30, along with FPs = .45 and INCs = .25. If you haven't noticed, those numbers look at lot like the math I showed a few paragraphs back. Krapohl (2006) cited other studies (Krapohl & Norris, 2000; Podlesney & Truslow, 1993) with similar numbers.

The thing to keep in mind is that the 1999 study was completed with traditional ZCT scoring rules, including the spot scoring rule. Senter (2001; 2003), showed that the spot scoring rules improve sensitivity to deception, though at a cost of increased FP errors, and suggested a procedural solution to mitigate those errors. Krapohl (2005) provided additional evidence of the 2-stage solution to gaining the benefits of the SSR increase in sensitivity while reducing INC and FPs. In a statistical model, we would use some fancy math to achieve the same objective, and that is what OSS-3 does. 2-stage rules are a really good example of a simple procedural solution to approximate the mathematical principle. There may be other unexplored opportunities to improve the MGQT and other techniques through procedural or decision policy adjustments. Exploiting those opportunities and optimizing our methods will require that we be willing to tolerate scrutiny around identifiable deficits , and disengage from any pretense that any of our present techniques present a panacea or optimal alternative for all situations.

I'm not sure Eric's statements are about the AFMGQT, but may be about the MGQT in general. Perhaps Eric can clarify this.

There is still a lot for us all to learn.

.02

r

------------------
"Gentlemen, you can't fight in here. This is the war room."
--(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)


IP: Logged

J.B. McCloughan
Administrator
posted 05-18-2008 11:26 PM     Click Here to See the Profile for J.B. McCloughan   Click Here to Email J.B. McCloughan     Edit/Delete Message
Ray,

Great post.

The Army MGQT, which I am assuming you were referring to as a ZCT, is also commonly referred to as the DI test.

As for the AFMGQT, the only apparent difference between it and a ZCT format is there are no "symptomatics" and recommended cutting scores.

IP: Logged

dkrapohl
Member
posted 05-19-2008 03:24 PM     Click Here to See the Profile for dkrapohl   Click Here to Email dkrapohl     Edit/Delete Message

The presentation I'd made some time ago regarding the problems with the MGQT related to the Army version, not the Air Force version. The research picture on the Air Force version is incomplete, but because it entails several defensible practices we can be less worried about it. We see from a lab study by Cullen and Bradley a couple years ago that the question sequence I-R-C produces more negative scores than the sequence I-C-R. In other words, if an irrelevant question is placed immediately before the relevant question in the sequence, scores are significantly more toward the negative direction than when a relevant question is preceded by a comparison question. In fact, truthful examinees were showing average spot score totals of 0 and below for the I-R-C sequence. I and my colleagues compiled some field data which showed the very same trend. As you are aware, the Army MGQT has two of its relevant questions immediately preceded by irrelevant questions. We found that the average spot score for a truthful examinee for those two relevant questions was between 0 and +1, consistent with the Cullen and Bradley findings. This trend is made worse by the Army MGQT decision rules: the Army MGQT requires a +3 total for every relevant question. With an average spot score of only 0 to +1 for truthful examiness and the requirement to have a +3 spot score to be called truthful there is good cause to call the Army MGQT a DI test. The field (federal) ZCT data we had on hand did not show the suppressed spot scores from truthful examinees but did have scores from deceptive cases that were just as far below 0 as were those from the Army MGQT. In other words, the ZCT did better with truthful cases with no loss of sensitivity for deceptive cases. Because the question sequence of the AF MGQT looks a lot like a ZCT without symptomatic questions, there is optimism that it will work well. Ray's observation, though, regarding mixed-issue and multi-facet examinations are true regardless of question sequence, and he has the data to show it. It's not just about question sequences.

Speaking of Ray, who may be too modest to mention it, but he just submitted a paper to Polygraph that explains the underpinnings of the OSS-3, a project in which he put in many hundreds of hours (for free) and which resulted in some impressive results. Here's a teaser: it produced a higher decision accuracy than 9 of 10 human scorers who conducted blind scoring on the same 100 cases. He did it neither for money nor ego, but simply to improve the field. Congrats, Ray.


Don

IP: Logged

rcgilford
Member
posted 05-19-2008 07:32 PM     Click Here to See the Profile for rcgilford   Click Here to Email rcgilford     Edit/Delete Message
I have a lot of experience in the Army MGQT and it is referred to as a DI test. Although I still have it in my bag of tricks, I can’t recall the last time I used it. There are too many other test formats that are better. The good thing about the Army MGQT is…….if you get NDI charts, you can bet they are indeed NDI! I’m not even sure Army CID is even using the MGQT anymore because of a false positive problem. Skip can clarify that information.

[This message has been edited by rcgilford (edited 05-19-2008).]

IP: Logged

rcgilford
Member
posted 05-19-2008 07:37 PM     Click Here to See the Profile for rcgilford   Click Here to Email rcgilford     Edit/Delete Message
Ray (and anyone else involved in OSS-3),

Thank you. I use it and think it is a great move forward in chart evaluation.

IP: Logged

rnelson
Member
posted 05-19-2008 10:57 PM     Click Here to See the Profile for rnelson   Click Here to Email rnelson     Edit/Delete Message
Thanks rcg and Don,

I've been using the Limestone OSS-3 tool for almost a year, and of course I think it's really great. Sometimes I wonder where it came from, and then I remember that OSS-3 exists because of the 20+ years of work on the part of a lot of smart people (Don included).

All we did was put the existing pieces of technology together, and solve some transformation problems that allow us to approximate the distribution of spot scores, then build a complete decision model that addresses the combinatoric problems of spot and total scores, train the model by running a few thousand resample sets of data, get some other data and validate on other types of exams, then lather, rinse, and repeat.

This has to be the funnest way to learn about scoring algorithms. We've had a lot of encouragement, and its been very gratifying to be able to build on the work of the true giants in the field.

All things considered, OSS-3 has come together kind of quickly for an unfunded project of this scope. If it weren't for Mark Handler, Don Krapohl, and all of the existing knowledge from others, I would no doubt still be scratching my head wondering what the heck goes on in those scoring algorithms.

Lafayette has, I think, now finished work in LXSoftware 10.0 and OSS-3. They have a really slick tool, with good options and a nice chart artifact/review tool.

A family member read through a draft of the paper we submitted to Polygraph, and told me "if I ever have to take a polygraph, I want number 2," referring to scorer #2 who outperformed the algorithm. We may not know for sure who is number 2, but I'm sure its not me. I'm also sure most of us are not # 2. So that is the remaining mystery:

Who is # 2?

(I don't really want to know.)

r

------------------
"Gentlemen, you can't fight in here. This is the war room."
--(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)


IP: Logged

stat
Member
posted 05-19-2008 11:11 PM     Click Here to See the Profile for stat   Click Here to Email stat     Edit/Delete Message
Don, you didn't include some research numbers on the AFMGQT at the Mid West Regional Poly Conference in I believe 2005? Are you sure? That was where I met you for the first time actually----but at your presentation you listed several newly revisited formats (the greatest hits) and gave some very discouraging accuracy rates on even the newer formats. Given your standing, experience, and proven loyalty----I was very convinced. Finding the Power Point notes will take a team of investigators as I am moving this week and everything older than my current loaf of bread is in a box. Then again, I am not altogether sure you presented a graphic of those figures, but I am sure I took notes. It was a very impressive presentation----if not a little ego-shattering for many of us. That's probably a good thing, eh?

I am convinced---falsely or truly---that the AFMGQT was included in that informative run-down. Everyone has long known the standard MGQT is a "DI test"---hardly a test at all really----but when you included the AFMGQT, there were audible gasps of suprise----as like Taylor, many of us felt that the AFMGQT was a newer and better designed mousetrap and above the typical "old school maladies" like the R&I, MGQT and others. It was after that conference when I began theorizing some "out there" approaches such as the neuro-lock/oathing test as an attempt at a remedy for what I percieved as being an unreliable method of multi-issue credibility assessment used widely in many fields.
p.s.
I am still convinced that my theories have promise. I predict without the benefit of a graduate degree that simpler scoring, statistically better-placed question placement----are all a worthwhile short-term pursuit. But I am afraid we are adding some better and even cool features to a jalopy.
Photobucket


[This message has been edited by stat (edited 05-19-2008).]

IP: Logged

dkrapohl
Member
posted 05-21-2008 01:08 PM     Click Here to See the Profile for dkrapohl   Click Here to Email dkrapohl     Edit/Delete Message

Stat:
I pulled up my archives, and in the folder on the Midwest Regional Polygraph Seminar (2005) I find only three presentations: chart interpretation, comparison of three major scoring systems, and a generic presentation on best practices. In none of those was there anything about decision accuracy for a particular technique. My filing skills being what they are, I won’t swear that I didn’t do another presentation. However I’m fairly confident that I have never presented any accuracy information on the AFMGQT because I’m not certain that any has ever been published. If anyone on this board knows of a study I’ve overlooked, please send a note to set me straight (dkrapohl@aol.com). Don

IP: Logged

Taylor
Member
posted 05-21-2008 04:22 PM     Click Here to See the Profile for Taylor   Click Here to Email Taylor     Edit/Delete Message
E - I think the confusion may be coming from a chart of validated techniques and the AFMGQT was not on the list (Barry just presented on this at AAPP). He said the AFMGQT was not listed because there were too many versions of the AFMGQT floating around; he didn't say it was a bad test.


IP: Logged

Barry C
Member
posted 05-21-2008 05:10 PM     Click Here to See the Profile for Barry C   Click Here to Email Barry C     Edit/Delete Message
I suspect it's a good test, yes. (I like it.) It was our own Elmer Criswell who said somebody from DACA told him there were too many versions (2RQ, 3RQ and 4RQ) floating around to know what is actually being used.

IP: Logged

All times are PT (US)

next newest topic | next oldest topic

Administrative Options: Close Topic | Archive/Move | Delete Topic
Post New Topic  Post A Reply
Hop to:

Contact Us | The Polygraph Place

copyright 1999-2003. WordNet Solutions. All Rights Reserved

Powered by: Ultimate Bulletin Board, Version 5.39c
© Infopop Corporation (formerly Madrona Park, Inc.), 1998 - 1999.